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Abstract 

An algorithm is presented to compute isolated values of the divisor 
summatory function in O (n^Jtime and O (logn) space. The algorithm 
is elementary and uses a geometric approach of successive approximation 
combined with coordinate transformation. 

1 Introduction 

Consider the hyperbola from Dirichlet's divisor problem in an xy coordinate 
system: 

H(x,y) = xy 
= n 

The number of lattice points under the hyperbola can be thought of as the 
number of combinations of positive integers x and y such that their product is 
less than or equal to n: 

T(n)= 1 C 1 ) 

x,y:xy<n 

As such, the hyperbola also represents the divisor summatory function, or 
the sum of the number of divisors of all numbers less than or equal to n: 

= E 1= E 1 ( 2 ) 

d\x x,y:xy=n 



T(n) = £r( 



.el 

x=l 



One geometric algorithm is to sum columns of lattice points by choosing an axis 
and solving for the variable of the other axis: 

n 

X—l 
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which gives an O (n) algorithm. By using the symmetry of the hyperbola (and 
taking care to avoid double counting) we can do this even more efficiently: 



x=l 

which gives an O (n 1 / 2 ) algorithm and is in fact the standard method by which 
the divisor summatory function is computed. Our goal is to break this square- 
root barrier. 

In 1903, Vorono'i in pQ made the first significant advance since Dirichlet on 
the bound on error term for the divisor problem by decomposing the hyperbola 
into a series of non-overlapping triangles corresponding to tangent lines whose 
slopes are extended Farey neighbors. We will use a similar approach but where 
Vorono'i produced an exact expression for the error term and estimated its mag- 
nitude, we will instead produce an algorithm to determine a precise lattice count 
for an isolated value of n. 



2 Preliminaries 

It will be convenient to parameterize the sum in T (n) as: 

X2 

S (n,xix 2 ) = ^ pi (5) 

X — X\ 

so that: 

T(n) = S(n,l,n) = 2S(n,l, - [Vn\ 2 (6) 

We will also need to count lattice points in triangles. Consider an isosceles right 
triangle (0, 0) , , (i, 0), i an integer, excluding points on the bottom gives 
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This formula is also applicable to triangles of the form (0, 0) , (i, ai) , (i, (a — 1) i), 
a a positive integer. If we desire to to omit the lattice points on two sides, we 
can use A (i — 1) instead of A (i). 
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3 Region Processing 



Instead of addressing all of the lattice points, let us for the moment consider the 
sub-task of counting the lattice points in a curvilinear triangular region bounded 
by two tangent lines and a segment of the hyperbola. If we can approximate 
the hyperbola by a series of tangent lines, then the area below the lines is a 
simple polygon and can be calculated directly by decomposing the area into 
triangles. On the other hand, the region above the two lines can be handled 
by chopping off another triangle with a third tangent line which creates two 
smaller curvilinear triangular regions. 

We will now go about counting the lattice points in such region. We will do 
this by first transforming the region into a new coordinate system. This is very 
simple conceptually but there are a number of details to take care of in order to 
count lattice points accurately and efficiently. First, the tangent lines are not 
true tangent lines but are actually shifted to pass through the nearest lattice 
points. Because of this, tangent lines need to be "broken" on either side of 
the true tangent point in order to keep them under but close to the hyperbola. 
Second, the coordinate transformation turns our simple xy = n hyperbola into 
a general quadratic in two variables. Nevertheless, the recipe at a high level is 
simply "tangent, tangent, chop, recurse." 

This figure depicts a typical region in the xy coordinate system: 

























































1 






















































F 












































































































































































































































































1 


\\ 






















































\ 
















































































































































































































































































a i \ 


















































n 


H = - 








































































H(x. v) 






n 






































































































































































































































































































































































\ \ 














































\ \ 
























































































































































































































































































































































































































































































































r 


n 


-> 




pa 


































































>2 



































































































































































































































































































































Define two lines L\ and Li whose slopes when negated have positive integral 



3 



numerators a* and denominators 6»: 

-m, = | (7) 

The slopes are chosen to be Farey neighbors so that the determinant is unity: 

— ai b 2 — bi a 2 = 1 (9) 



ai 6i 

&2 &2 



and the slopes are rational numbers which we require to be in lowest terms and 
so we can assume gcd (oi, b\) — gcd (a 2 , 62) = 1- 

Assume further that the lines intersect at the lattice point Pq: 



(xo,yo) 

with x and y positive integers. 

Then the equations for the lines L\ and L 2 in point-slope form are: 



y-yo 

x — x 

y-yo 

x — x 

and converting to standard form: 

ai x + bi y 
a 2 x + b 2 y ■ 



Oi 

h 

02 

~b 2 



xo ax + y 61 
a 2 + y Q b 2 



and defining: 
we have: 



Cj = x ^ + y bi 



a\X + b\y 
a 2 x + b 2 y 



Cl 
C2 



Solving the definitions of C\ and c 2 for a; an d y give: 

x = ci b 2 - bi c 2 
yo = aic 2 - ci a 2 



(10) 

(11) 
(12) 



(13) 
(14) 

(15) 



(16) 
(17) 



(18) 
(19) 



Now observe that the xy lattice points form an alternate lattice relative to lines 
Li and L 2 : 
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Define a uv coordinate system with an origin of Pq, Li as the v axis and 
L 2 as the u axis and u and v increasing by one for each lattice point in the 
direction of the hyperbola. Then the conversion from the uv coordinates to xy 
coordinates is given by: 

x = Xo + b^u — biv (20) 
y = y - a 2 u + aiv (21) 

Substituting for Xq and j/o and rearranging gives: 

x = b 2 (u + ci) -&i (v + c 2 ) (22) 
y = ai (v + c 2 ) - a 2 (u + ci) (23) 

Solving these equations for u and v and substituting unity for the determinant 
provides the inverse conversion from xy coordinates to uv coordinates: 

u = a\ x + b-y y — C\ (24) 
v = a 2 x + b 2 y - c 2 (25) 

Because all quantities are integers, equations (|2"2")l , (f2"3"]). ([M]), (|23|) mean that 
each xy lattice point corresponds to a uv lattice point and vice versa. As a 
result, we can choose to count lattice points in cither xy coordinates or uv 
coordinates. 

Now we are ready to transform the hyperbola into the uv coordinate system 
by substituting for x and y in H (x, y) which gives: 

H(u,v) = (b 2 (u + ci) -b 1 (v + c 2 )) (ai {v + c 2 ) - a 2 {u + c x )) (26) 
= n 
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Let us choose a point P\ (0, h) on the v axis and a point P 2 (w, 0) on the u axis 
such that: 

H (u h ,h) = n 

H(w,v w ) = n 

0^u h < 1 

< v w < 1 

—dv/du (uh) > 

—du/dv(v w ) > 

or equivalently that the hyperbola is less than one unit away from the nearest 
axis at Pi and P2 and that the distance to the hyperbola increases as you 
approach the origin. 

With these constraints, the hyperbolic segment has the same basic shape as 
the full hyperbola: roughly tangent to the axes at the endpoints and strictly 
decreasing relative to either axis. 
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dinate system: 



We can now reformulate the number of lattice points in this region R as a 
function of the eight values that define it: 

S R = Sr (w, h, ai, bi, ci, a 2 , b 2 , c 2 ) (27) 

If H (w, 1) < n, then v w > 1 and we can remove the first lattice row: 

Sr = S R (w, h - l,a 1; 61, ci, a 2 , b 2 , c 2 + 1) + w (28) 
and if H (1, h) < n, then ut > 1 and we can remove the first lattice column: 

Sr = Sr(w— 1, h, ax, b\, C\ + 1, a 2 , 6 2 , c 2 ) + h (29) 
so that the conditions are satisificd. 
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At this point wc could count lattice points in the region bounded by the u 
and v axes and u = w and v = h using brute force: 



S R = £ 1 (30) 

More efficiently, if we had a formulas for u and v in terms each other, we could 
sum columns of lattice points: 

Sw (w) = YZ=i l v ( u )\ (31) 
S H (h)=E h v =ilU(v)\ (32) 

using whichever axis has fewer points, keeping in mind that it could be assym- 
metric. (Note that these summations are certain not to overcount because by 
our conditions V (u) < h for < u < w and U (v) < w for < v < h.) 
And so: 

S R (w,h,ai,bi,ci,a 2 ,b 2 ,c 2 ) = S w (33) 

= S H 

In fact we can derive formulas for u and v in terms of each other by solving 
H (u, v) — n (which when expanded is a general quadratic in two variables) for 
v or u. The resulting explicit formulas for v in terms of u and u in terms of v 
are: 



(aib 2 + ha2) (u + c{) - \ (u + c x ) 2 - 4 aihn 
V(u) = — -V C2 (34) 

(ai6 2 + bia 2 ) (v + c 2 ) - J (v + c 2 f - 4 a 2 b 2 n 
U ^ = 2^ Cl ^ 

(Note exchanging u for v results in the same formula with subscripts 1 and 2 
exchanged.) 

As a result we can compute the number of lattice points within the region 
using a method similar to the method usually used for the hyperbola as a whole. 
Our goal, however, it to subdivide the region into two smaller regions and process 
them recursively, only using manual counting at our discretion. To do so we 
need to remove an isosceles right triangle in the lower-left corner and what will 
be left are two sub-regions in the upper-left and lower-right. 

This figure shows the right triangle and the two sub-regions: 
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A diagonal with slope -1 in the uv coordinate system has a slope in the xy 
coordinate system that is the mediant of the slopes of lines L\ and L2: 



m 3 



ai + a 2 
61+62 



(36) 



So let us define: 



a 3 = ai + a 2 (37) 
6.3 = 61+62 (38) 

Then differentiating H (u,v) = n with respect to u and setting dv/du = — 1 
gives: 

(ai 6 2 + 61 a 2 + 2 a 2 6 2 ) (u + c x ) = (ai 6 2 + 61 a 2 + 2 ai 61) (w + c 2 ) (39) 

and the intersection of this line with H (u, v) = n gives the point Ptan on the 
hyperbola where the slope is equal to -1: 



Uta 



(ai 62 + 61 a 2 + 2 a\ 61) 



{a\ 6 2 + 61 02 + 2 a 2 62) 



a 3 63 



03 63 



Cl 



C2 



(40) 
(41) 



The equation of a line through this intersection and tangent to the hyperbola 
is then u + v — ittan + u tan which simplifies to: 



u + v = 2 ^/a 3 63 n — ci — c 2 (42) 
Next we need to find the pair of lattice points P4 (u^v^) and P5 (1*5,^5) such 
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that: 



«4 
"5 

—dv/du (114) 
—dv/du (1*5) 

V4, 

V5 



> 

> 
< 



1 





U4 
1 
1 

[V(u 4 )\ 
[V(u 5 )\ 



The derivative conditions ensure that the diagonal rays with slope — 1 pointing 
outward from P4 and P5 do not intersect the hyperbola. Setting 114 = |_WtanJ 
will satisfy the conditions as long as U4 7^ 0. 

Let the point at which the ray from P4 intersects the v axis be Pq (0, vq) and 
the point at which the ray from P5 intersects the u axis be P7 (u?, 0). Then: 



v 6 = U4 + v A 
Uj = u 5 + v 5 

A diagram of all the points defined so far: 



(43) 
(44) 
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Then the number of lattice points above the axes and inside the polygon N 
defined by points Po, Pq, P4, P5, P7 is 

S N = A (v 6 - 1) - A (v 6 - u 5 ) + A (u 7 ~ u 5 ) 



or 



A (w 6 — 1) + U4 if v e < u 7 
Sn = { A(« 6 -l) ifv 6 = u 7 

A (u 7 — 1) + w 5 if i>6 > u 7 



(45) 



because counting on reverse lattice diagonals starting at the origin we sum 
1 + 2 + . . . + (min (vq, u 7 ) — 1) plus a partial diagonal if the polygon is not a 
triangle. 



9 



Using the properties of Farey fractions observe that: 



ai bi 

a 3 b 3 

a 3 b 3 

a 2 b 2 



ai (bi + b 2 ) - h (ai + 02) = a-ih - b\a 2 = 
(ai + 02) b 2 - (bi + b 2 ) a 2 = aib 2 - b Y a 2 = 



01 61 

a 2 b 2 

ai h 

a 2 b 2 



so that mi and 7713 are also Farey neighbors and likewise for 771,3 an d m 2 . 

So we can define region R' to be the sub-region with P[ 
P4 and the region R" to be the sub- region with P" = P$,P { 
then the number of lattice points in the entire region is Sr = Sn 
or 



Pi,Pq = Pe,P 2 — 
-- P 7 ,P 2 = P2 and 

Sr' + Srh 



S R (w,h,a 1 ,b 1 ,c 1 ,a 2 ,b 2 ,c 2 ) = S N (46) 

+ S R (ti 4 , h - v 6 ,ai,bi,ci,a 3 , & 3 ,ci + c 2 + v 6 ) 
+ S R (w- u 7 ,v 5 ,a 3 ,b 3 ,ci +c 2 + u 7 ,a 2 ,b 2 ,c 2 ) . 

This recursive formula for the sum of the lattice points in a region in terms 
of the lattice points in its sub-regions allows us to use a divide and conquer 
approach to counting lattice points under the hyperbola. 



4 Top Level Processing 

Now let us return to the hyperbola as a whole. It should be clear that it is easy 
in xy coordinates to calculate y in terms of x by solving H (x, y) = n for y: 

T) 

Y(x) = - (47) 

We know that we only need to sum lattice points under the hyperbola up to 
Lv^J- The point y/n is in fact at the x = y axis of symmetry and so the 
slope at that point is exactly — 1. The next integral slope occurs at —2, so our 
first (and largest) region occurs between slopes —mi = 2 and — m 2 = 1. By 
processing adjacent integral slopes we will start in the middle and work our way 
back towards the origin. 

However, we cannot use the region method for the whole hyperbola because 
regions become smaller and smaller and eventually a region has a size w + h < 1 . 
We can find the point where this occurs by taking the second derivative of Y (x) 
with respect to x and setting it to unity. In other words, the point on the 
hyperbola where the rate of change in the slope exceeds one per lattice column, 
which is: 

x = V2n = 2 1/3 n 1/3 w 1.26n 1/3 (48) 

As a result there is no benefit in region processing the first O (77 1 / 3 ) 1 attice 
columns so we resort to the simple method to sum the lattice columns less than 
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(49) 



<2n 



Imax = \_\fn\ 
£/min \y (^max)J 

Si = S(n, l,Xm in - 1) 

where C\ > 1 is a constant to be chosen later. 

Next we need to account for the all the points on or below the first line which 
is a rectangle and a triangle: 

^2 — (^max -^min ~t~ 1) 2/min ~t~ ^ (^max -^min) 

Because all slopes in this section of the algorithm are whole integers, we have: 

cii = -mi 
h = 1 

Assume that we have point P 2 and value a-i from the previous iteration. For 
the first iteration we will have: 

X2 — ^max 
J/2 = 2/min 
a 2 = 1 

For all iterations: 

ai = a,2 + 1 

The x coordinate of the point on the hyperbola where the slope is equal to 
mi can be found by taking the derivative of Y (x) with respect to x, setting 
dy/dx — mi, and then solving for x: 

X ta .n = \ — (50) 

V °i 

Similar to processing a region (but now in xy coordinates), we now need two 
lattice points P4 (#4,2/4) and P5 (#5, 2/5) such that: 

X4. ^> X m i n 
X 5 = X4 + I 

— dy/dx (#4) > ai 

— dy/dx (xs) < ai 

2/4 = [Y(x 4 )\ 

2/5 = [Y(x 5 )\ 
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To meet these conditions we can set X4 — [x tan \ unless X4 < x m i n in which 
case we can manually count the lattice columns between x m - ln and X2 and cease 
iterating. If so, the remaining columns can be computed as: 



X2-1 

£ 

X— X m i 



L~ - («2 [X2 - x) + y 2 ) 



which is the number of lattice points below the hyperbola and above line Li 
over the interval [x m ; n ,X2). 

Now take line Li with slope —ai passing through Pi, lines L4 and L5 with 
slopes — a\ and passing through P4 and P 5 and then find the point P 6 where L4 
intersects x = x m i n and the point Pq where L5 intersects L 2 and the point P7 
where Li intersects x — x m j n and denote by Cj the y intercept of line Li. 
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Now add up the lattice points in the polygon M defined by the points 
Po, P7, P§, P4, P5 but above L 2 by adding the whole triangle corresponding to 
L4, subtracting the portion of it to the right of P4, and then adding back the 
triangle corresponding to L5 stating at P5: 



S 



A (c 4 - c 2 - x min ) - A (c 4 - c 2 - x 5 ) + A (c 5 - c 2 - x 5 ) (51) 

where if L4 is coincident with L5, the second two terms cancel each other out. 

Then choosing Pi = P5 (together with Po and P 2 ) and calculating the nec- 
essary quantities we have a region R and can now count lattice points using 
region processing: 

Sr. = S R (axxi + y 2 - c 5 , a 2 x 5 + y 5 - c 2 ,ai, l,c 5 ,a 2 , l,c 2 ) (52) 

so the total sum for this iteration is: 



Sa (ai) = Sm + Sr 
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Then we may advance to the next region by setting: 

2/2 = 2/4 

a' 2 = d\ 

Summing all interations gives 

a max 

5*4 = ^2 S A (a) . 

a=2 

Finally, the total number of lattice points under the hyperbola from 1 to a; max 
is 

S T = S(l,x max ) = Si + S 2 +S 3 + S4, (53) 
and therefore the final computation of the divisor summatory function is given 
by 

T (n) = 2S T - [V^\ 2 (54) 

5 Division-Free Counting 

Since we calculate Si using the traditional method and since the computation 
will consist entirely of Si when n < 4Cf , it is beneficial to have a faster method 
of performing this step, albeit by a constant factor. Denote by I = [log 2 (n)] 
the number of bits needed to represent n. We can avoid an Z-bit division in 
most iterations by using a Bresenham-style calculation (see [2]) and working 
backwards while computing an estimate of the result of the division based on 
the previous iteration. 

Define /3 (x) — [Y (x)\ , the finite difference #i (x) = (3 (x) — (3 (x + 1), and 
the second-order finite difference 82 (x) = Si (x) — 61 (x + 1). To check whether 
the value is correct we also need to keep track of the error. So defining the error 
e (x) = n — xj3 (x) = n — x \ n/x\ = nmodx gives 

e{x)~e{x + l) = (x + a) /3 (x + 1) - xf3 (x) 

= (x + 1) P (x + 1) - x (/3 (x + 1) + 61 (x + 1) + S 2 (x)) 
= (3(x + l)~ xSi (x + 1) - x5 2 (x) 

Introducing the intermediate quantity 7 (x) = (3 (x) — (x — 1) 5\ (x) and e (x) as 
the estimate of the error assuming 62 (x)=0 then 

e(x + 1) +7(2; + 1) 

X 

Si (x + 1) + S 2 (x) 
e (x) — XS2 (x) 
7 (x + 1) + 2Si (x) - xS 2 (x) 
[3(x + l)+Si (x). 



i(x) = 

S 2 (x) = 

Si(x) = 

e(x) = 

7 (x) = 

(3(x) = 
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Over the range x\ < x < x 2 these integer quantites are bounded in size by 
x,s (x) , \i (x)\ < X2,\'f (x)\ < max (2n/ (x\ + xi) ,x 2 ), f3 (x) < n/x\, 81 (x) < 
n/x\+l,\5 2 (x)\ < 2n/x\ + 2. 

For v^n < x < y/n, 5 2 (x) £ {-1,0, 1,2} and so 



i(x) 



2 ife(a;)>2x; 

1 if x < i (x) < 2x; 

-1 ife(a;)<0; 

otherwise; 



and thus /3 (x) , 7 (x) , 5\ (x) , e (x) can be computed from /3 (a; + 1) , 7 (x + 1) , Si (x + 1) , e (x + 1) 

using only addition and subtraction of i^-bit quantities except f3 (x) which is 

|Z bits. Note that e (x) > 2x is very rare over this range and if e (x) > 3x, it 

means that x < \/2n. For n 1 / 6 < x < y^2n we can add the modest division 

[e (x) jx\ between two i/-bit values, 7 (x) and <5i (x) grow to |Z bits and j3 (x) 

grows to |^ bits. For x < n 1 / 6 we can sum using ordinary division. 



6 Algorithms 



In this section we present a series of algorithms based on the previous sections. 
The short-hand notation F (x) : expression signifies a functional value that 
remains unevaluated until referenced. 

The first algorithm is a straightforward version of the basic successive ap- 
proximation method. A literal implementation based on this description will 
offer many opportunities for optimization. Various formulas have been slightly 
modified so that the entire algorithm can be implemented using only unsigned 
multi-precision integer arithmetic. The operations required arc addition, sub- 
traction, multiplication, floor division, floor square root, ceiling square root, and 
ceiling cube root. If any of the root operations are not available, they may be 
implemented using Newton's method. 
Algorithm 1 
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Ex=r m \ji/x\ - (a 2 (x 2 -x) + y 2 ) 



■0 



Inputs: n > 0, d w 10, C 2 
A(i) : * (i + 1) /2 
Si () : El= X r n InM 
S*Q: 
Ss():_ 

S M () : A (c 4 - c 2 - x min ) - A (c 4 - c 2 - x 5 ) + A (c 5 - c 2 - x 5 ) 

Imax <~ [Vn\ ,2/min <~ [n/x max \ , X min <~ Hlin ( \C\ \^2n\ , X max ) 

s <- 0, a 2 4- 1, x 2 «- x max , y 2 <- y min , c 2 <- a 2 x 2 + y 2 
loop 
a\ 

X4 



■ CL2 + 1 

£5 -s— X4 + 1 , J/5 - 



[n/x 5 \ ,C5 «- aix 5 + j/5 



if x 4 < x m in then exit loop end if 
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s 4- s + S M () + S R {aix 2 +V2- c 5 ,a 2 x 5 + y 5 - c 2 ,ai, l,c 5 ,a 2 , l,c 2 ) 
a 2 «- ai,x 2 <- x 4 ,y 2 «- y 4 ,c 2 «- c 4 
end loop 

S <- S + & () + s 2 () + s 3 () 

return 2s - x^ ax 

function Sr (w, h, ai, bi, ci, a 2 , 6 2 , c 2 ) 
A(i) :*(i + l)/2 

i? (u, w) : (6 2 (u + ci) - h (v + c 2 )) (oi (u + c 2 ) - a 2 (u + a)) 



(7ta„ () : 

Vfioor (u) : 
C 2 

Vfioor (u) : 

c 2 



(oi 6 2 + bi a 2 + 2 ai h) 2 n/ (a 3 6 3 )J 



ci 



(oi 6 2 + 6i a 2 ) (u + ci) - 
(oi 6 2 + 6i a 2 ) (v + c 2 ) - 



/(2ai6i) 



\J {u + ci) 2 - 4 ai &i n 



\j iy + c 2 ) 2 - 4 a 2 6 2 n ^ / (2a 2 fo 2 



^ ():E^r^oor («) 
fe():E:=? ^floor («) 

SV () : A («e - 1) - A (u 6 - u 5 ) + A (u 7 - u 5 ) 
s 4- 0, a 3 «- oi + a 2 , 63 <- 61 + b 2 

if /1 > A H (w, 1) < n then s<— s + w, c 2 <— c 2 + 1, /i <h- /i — 1 end if 
if to > A H (1, /i) < n then s <— s + h, c\ 4- Ci + 1, w 4— w — 1 end if 
if w < C 2 then return s + Sw () end if 
if h < C 2 then return s + Sh () end if 

Ui <~ C/tan () , Vi <- Vfl oor (lt 4 ) , W 5 «- U 4 + 1 , U 5 «- Vfi oor (u 5 ) 
«6 "4 + f4j U 7 4- U 6 + V 6 
S 4- S + S N () 

s «- s + Sr (u 4 , /i - t> 6 ,ai,6i,ci,a 3 ,6 3 ,ci + c 2 + u 6 ) 
s «- s + 5 fl (tu - u 7 ,W5,a 3 ,6 3 ,C! + c 2 + u 7 ,a 2 ,b 2 ,c 2 ) 
return s 
end function 



The next algorithm gives a flavor for the optimizations that are available. It 
computes the manual summation of a small region over u or v using a handful 
of additions, one square root and one division per lattice column. A similar 
technique can be used to compute Vfl 00 r for the adjacent values w 4 and u 5 . 
Making this portion of the computation faster favors larger values of C 2 , the 
cutoff for small regions. An analogy is that this step is faster for small regions 
in the same way that an insertion sort is faster than a quicksort for small arrays 
and the break even point can be determined experimentally. 
Algorithm 2 

S w () : 5*7 {w, ci, c 2 , aib 2 + 6i0 2 , 2aih) 
S H () : Si (h, c 2 , ci, a : b 2 + ha 2 2a 2 b 2 ) 
function Si (i max ,pi,p 2 , q, r) 

s 4- 0, A 4- p\ - 2rn, B 4- piq, C 4- 2pi - 1 
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for i = 1, . . 

C a-C- 

s «- s + 

end for 
return s — (i 
end function 



'max - 1 do 

2,Aa- A + C,B A- B + q 

in ~ ^ 



1)P2 



The next algorithm formalizes the steps of the division-free counting method 
which can be used for the summation Si. Whether this is actually faster 
depends on many things but for example if n < 2 94 , then (3, 5, \ j\ , |e| < 2 63 for 
2 32 < x < 2 47 and if signed 64-bit addition is a single-cycle operation, then 
a computation of (3 using this method is about ten cycles vs. say a hundred 
cycles for a single multi-precision division. 
Algorithm 3 

Si () : S Q (l,x min - 1) 
function Sq (xi,x 2 ) 

s A- 0,x A- X2,0 A- [n/ (x + 1)J , £ A- nmod(i + 1) ,6 A- ["/^J — 

/3,7 A- P - xS 

while x > x\ do 

£ A- £ + 7 

if e > x then 

(5<— 5+1,7 A— 7 — X, £ <h- £ — X 

if e > x then 

5 A- 5 + 1,7 A- 7 — x,e A- e — x 

if e > x then exit while end if 
end if 
else if e < then 

5 A- 5 — 1,7 A- 7 + X,E A- E + X 

end if 

7<-7 + 2<5, P A- P + 5, s a- s + /3, x A- x - 1 
end while 

e <— nmod (x +1) ,5 A- [n/x\ — (3, 7 A— /3 — x<5 
while x > xi do 

£ <— £ + 7, (5 2 <— [s/x\ ,5 a- 5 + 6 2 ,£ A- £ — X(5 2 

7^-7 + 25- x<5 2 , P A- P + 5, s A- s + P,x A- x - 1 
end while 
while x > xi do 

s A- s + [n/x\ , x A- x — 1 
end while 
return s 
end function 
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7 Time and Space Complexity 



Now we present an analysis of the runtime behavior of algorithm. 

Theorem 1 The time complexity of algorithm J^j when computing T (n) is 
O (n 1 / 3 ) and the space complexity is O (logn). 

Before we start, we realize that because x m i n = O (n 1 / 3 ) and we handle the 
values of 1 < x < x m in manually, the algorithm is at best O (n 1 / 3 ) . In this 
section we desire to show that the rest of the computation is at worst O (n 1 / 3 ) 
so that this lower bound holds for the entire computation. 

Our first task is to count and size all the top-level regions. We process one 
top level region for each integral slope —a from —1 to the slope at x m - m . The 
value for a at each value of x is given by: 

a = ~Y{x) = ^ (55) 
ax ar 



and: 



X{a)^J- a (56) 



Choosing C\ = 1 so that x m ; n = v2n, then the highest value of a processed is: 

(57) 



.r 



2 

min 



2 2 / 3 



so there are O (n 1/3 ) top level regions. 

How big is each top level region? The change in x per unit change in a is 
dx/da and so: 

d n 1 / 2 

Assume for the moment that the number of total regions visited while processing 
a region of size A is: 

N{A) = 0(A G ) 

noting that the cost of processing a region (excluding the cost of processing its 
sub-regions) is O (1) and so the total number of regions is representative of the 
total cost. 

Now we sum the number of sub-regions processed across all top level region: 



a=2 




iVtotal =J2 N ( A ) = 
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We can classify three cases depending on the value of G because the outcome 
of the integration depends on the final exponent of a: 

f 0(^/3) ifG<2/3; 
A/total = { O (n 1 / 3 log n) if G = 2/3; 

{ {n G ' 2 ) ifG>2/3. 

(Note that we cannot get below O (n 1 / 3 ) even if G = because we have at least 
flmax = O (n 1 / 3 ) top level regions.) 

Now let us analyze the exponent in N (A) . In order to determine the number 
of regions encountered in the course of processing a region of size A, we need to 
analyze the recursion depth. The recursion will terminate when w or h is unity 
because by our conditions it is then impossible for the region to contain any 
more lattice points. Our next task is to measure the size of such a region and so 
we need to know how many x lattice columns that terminal region represents. 

We can use the transformation between uv and xy coordinates given by (|20p 
to compute the difference between the x coordinates of P 2 at (1,0) and P\ at 
(0, 1), assuming the smallest case with w = h = 1: 

Ax = X2-X1+I >(x + l-b 2 -0- h)-(x + • 6 2 - 1 • &i)+l = h+b 2 +l > 61+62 

(60) 

so the size of a terminal region is greater than the sum of the denominators of 
the slopes of the two lines that define it. 

Each time we recurse into two new regions we add a new extended Farey 
fraction that is the mediant of the two slopes for the outer region. As a result, we 
perform a partial traversal of a Stern-Brocot tree, doubling the number of nodes 
at each level. However, for our current purposes we can ignore the numerators 
because we are interested in the sum of denominators. Because regions cannot 
overlap, this means that the sum of the denominators at the deepest level of the 
tree cannot exceed the size of the first region and that only denominators affect 
the recursion depth. 

Next we need to derive a formula for the sum of the denominators of a partial 
Stern-Brocot tree of depth D, For example, if the first node (01/61,02/62) is 
(2/1,1/1), the next two nodes are (2/1,3/2) and (3/2,1/1). Continuing and 
ignoring numerators we have the following (61, 62) tree: 

(1,1) 

/ \ 
(1,2) (2,1) 

/ \ / \ 
(1,3) (3,2) (2,3) (3,1) 




At each new level we have twice as many nodes and half of the numbers are 
duplicated from the previous level and the other half of the numbers are the 
sum of numbers of their parent node. Since each parent's sum contributes to 
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exactly two numbers in the children, the sum of the denominators at each level 
is triple the sum of the previous level. So staring with 1 + 1 = 2 leads to the 
sequence 2, 6, 18, 54, . . ., and denoting by f2 the set of terminal regions, the sum 
at depth D is therefore 

A > b i+ b 2 = 23 d . 

R-.Ren 

Because the number of terminal regions is |f2| = 2 D , we can now place a bound 
on |f2| in terms of A: 

£\ l/log 2 3 



|0| < 

V 2 J 

Finally, since the total number of regions is 1 + 2 + 4 + . . . + |fi| = J2f=i 2 8 , the 
number of regions as a function of the size A is 

N(A)=2\Q\ -l = o(A 1/los ^ 3 ) (61) 

and therefore G = 1/ log 2 3. 

Since l/log 2 3 w 0.63, this means that G < 2/3 and the proof that the 
overall time complexity of the algorithm is O (n 1 / 3 ) is complete. 

The space complexity is simply our recursion depth which can be at most 
O (logn). 



8 Higher-Order Divisor Sums 

The two-dimensional hyperbola and the functions r (n) and T (n) can be general- 
ized to higher dimensions. Using this notation r (n) = t 2 (n) and T (n) = T 2 (n). 
Then the divisor sum T3 (n), the summatory function for T3 (x) = J2 a bc=x ^' can 
be computed by summing under the three-dimensional hyperbola 

™- s^-tsLsJ-t^LfJ)- 

x,y,z:xyz<n z — 1 x—1 z—1 

Again using the symmetry of this hyperbola we can restrict the outer summation 
to \/n by counting nested "shells" , and avoiding double and triple counting, we 
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get 
T 3 (n) 




-a -2 +1 



where in the last step we use the identity £* =1 3 (z 2 - z)+l = 3 (fc (fc + 1) (2fc + 1) /6 - k (k + 1) /2)+ 
k = k 3 . Since S^n,;^, Lv 7 ™]) i s a partial result in the calculation of T (n), it is 
also has O (n 1 / 3 ) time complexity when using Algorithm [6]. As a result, we 
can calculate T3 (n) in 



£0(7 



1/3 



' 1/3 n 1 / 3 



,1/3 



(n 5/9 ) , 



a modest improvement over O (n 2 / 3 ) using a direct double summation. Similar 
derivations give O (n 2 / 3 ) for T 4 (n) and O (n 11 / 15 ) for T 5 (n) or O (n 1 - 4 /^* 3 )) 
for Tfc (n) in general. 



9 Remarks 



It would be possible to simplify the algorithm somewhat by removing the 
distinction between top level regions and region processing itself by starting 
with the region defined by (1/0,1/1). The reason for the current assymetry 
is two-fold. First, some of the solutions to the equations are degenerate when 
a.ibi = and would require special handling anyway. Second, and perhaps more 
importantly, we can also capitalize on the simpler xy coordinate system where 
possible. 

The two major sections of the algorithm, Si and S4, are easily paralleliz- 
able. The section S\ can divide summation batches to different processors. The 
section S4 can be revised to use a work queue of regions instead of recursion. 
During region processing, one region can be enqueued and the other processed 
iteratively. Available processors can dequeue regions that need to be processed. 
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terms in the T3 (n) 



In fact it turns out that the Syn/z, z+l 
summation skip over the problematic first O ^(n/z) 1 ^ 3 ^ columns by the time z 

reaches n 1 / 4 and then start eroding away the smallest regions as z approaches 
n 1 / 3 . Modifying the method slightly and then computing the time complexity 
of these two portions separately and allowing a max to decline appropriately we 
would achieve O (n 1 / 2 logn) for T3 (n) if we could prove that G = 1/2. In any 
case, using G = l/log 2 3 at least gives us O (n 5 / 9_c+£ ) for some c > 0. 



10 Related Work 

In [3] , Galway presents an improved sieving algorithm that also features region 
decomposition based on extended Farey fractions as well as coordinate trans- 
formation. In [4] , applications for the divisor summatory are function presented 
including computing the parity of tt (x) , the prime counting function, as well as a 
sketch for a different O (71 1 / 3 ) algorithm. In [5], the parity of the prime counting 
function is studied more closely and several related algorithms are developed. 
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